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Abstract — Covariance matrix estimates are an essential part 
of many signal processing algoritlims, and are often used to 
determine a low-dimensional principal subspace via their spectral 
decomposition. However, exact eigenanalysis is computationally 
intractable for sufficiently high-dimensional matrices, and in the 
case of small sample sizes, sample eigenvalues and eigenvectors 
are known to be poor estimators of their population counterparts. 
To address these issues, we propose a covariance estimator that 
is computationally efficient while also performing shrinkage on 
the sample eigenvalues. Our approach is based on the Nystriim 
method, which uses a data-dependent orthogonal projection 
to obtain a fast low-rank approximation of a large positive 
semidefinite matrix. We provide a theoretical analysis of the error 
properties of our estimator as well as empirical results, including 
examples of its application to adaptive beamforming and image 
denoising. 

Index Terms — low-rank approximation, covariance shrinkage, 
Nystrom extension, adaptive beamforming, image denoising 



I. Introduction 

THE need to determine a principal subspace containing 
some signal of interest arises in many areas of signal 
processing, including beamforming speech processing 
Q, and source separation |3|. Typically, subspace estimation 
involves computing the spectral decomposition of a covariance 
matrix that has been estimated using a set of observed data 
vectors. The estimator most commonly used in practice is 
the sample covariance, often preferred because it is simple 
to compute and has well-understood theoretical properties. 

When solving the subspace estimation problem, one faces 
two critical challenges. The first is that for p-dimensional 
data, the computational cost of the full spectral decomposition 
scales as 0{p^). In the case of high-dimensional data sets, 
or when the subspace estimation problem needs to be solved 
many times, obtaining the eigenvalues and eigenvectors of 
the sample covariance becomes a computational bottleneck. 
For this reason, algorithms have been developed to obtain 
approximate solutions to the eigenvalue problem Q, |j5|. 

The second challenge is that the eigenvalues and eigenvec- 
tors the sample covariance matrix are known to be be poor 
estimates of the true eigenvalues and eigenvectors, especially 
when operating in high dimensions with limited observations 
|[6), ||7|. In particular, the sample eigenvalues are known to 
be over-dispersed (relative to the true spectrum), and many 
researchers have focused on developing shrinkage estimators 
that yield improved estimation results |8)-|[TT]. 
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States Government. 



Instead of addressing these challenges separately, we pro- 
pose that by solving them concurrently, one can perform 
both tasks at a reduced computational cost. To this end, we 
develop an estimator based on the Nystrom method p2)-p4), 
a matrix approximation technique that uses a data-dependent 
orthogonal projection to approximate a positive semidefinite 
matrix. This approach leads to an estimator that not only 
admits computationally efficient spectral analysis, but also 
shrinks the eigenvalues of the sample covariance. 

We begin by formulating the covariance estimation problem 
and reviewing the Nystrom method for matrix approximation. 
We then develop its use as a covariance estimator, including a 
study of its error characteristics. We conclude with examples 
of the use the Nystrom covariance estimator in two practical 
applications: adaptive beamforming and image denoising. 

II. The Covariance Estimation Problem 

Let X be a p X n matrix whose columns xi, . . . , x„ are 
independent and identically distributed (i.i.d.) samples from 
an unknown p-variate distribution. Throughout the following, 
we assume Xi, . . . , x„ have zero mean and a finite covariance, 
denoted by the p x p positive semidefinite matrix 

The basic problem of covariance estimation is straightfor- 
ward: given X, we wish to construct an estimator S of S. 
As a function of random data, S is itself random, and thus 
its performance as an estimator is best understood through its 
statistical properties (conditional on the true covariance). In 
particular, we will be concerned with the bias matrix 



and the mean squared error (MSE) 



MSE(S S) 



E 



where ||-|| is a suitable matrix norm. A common choice of 
norm is the Frobenius norm, defined for a real matrix A as 
||A||j^ — [tr(A"^A)]^/^. This norm is used throughout the 
covariance estimation literature | [TT| , fTS] , and will be the 
primary one featured in our analysis. 

The Sample Covariance 

The most common covariance estimator is the sample 
covariance matrix. 



1=1 



1. 



-XX^ 



We refer to a p X p matrix S as positive semidefinite (denoted S >; 0) 
if it is symmetric and x^Sx > for all x S K", and as positive definite 
(denoted S ;^ 0) if it is positive semidefinite with x-^ Sx = if and only if 
X = 0. 
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This estimator has a number of qualities that make it a popular 
choice among practitioners. For example, it is unbiased, and 
its computational cost of 0{p^n) is not excessively expensive. 
When Xi, . . . ,x„ are i.i.d. samples from the p-variate normal 
distribution with mean and positive definite covariance S — 
denoted 7V^(0, S) — it corresponds to the maximum-likelihood 
estimator of S given the data. We can also compute its MSB 
with respect to the Frobenius norm, given by 



MSE(S |S) = - [tr(E2) +tr2(S)] 



(1) 



Despite these convenient properties, there are characteristics 
of S that make it unsuitable for many applications. For 
example, if n < p, then S is guaranteed to be rank-deficient; 
even if we have prior reason to believe that S is invertible, its 
estimate will be singular. Another issue is the over-dispersion 
of the sample eigenvalues. Let Ai(S). . . . , A)3(S) denote the 
eigenvalues of S in nonincreasing order. For fixed p and large 
n, the sample eigenvalues are reasonable estimators of the true 
eigenvalues of S, and it can be shown |16J that as n — ^ cx) 
with p fixed, \i{S) converges almost surely to Ai(S) for 
i = 1, . . . ,p. However, when p is allowed to grow with n 
(keeping the ratio n/p fixed), results such as the celebrated 
Marckenko-Pastur law suggest that the sample eigenvalues 
are not effective estimators, and fail to converge to the true 
eigenvalues |6j, ||7]. 

Shrinkage Covariance Estimators 

In the context of covariance estimation, shrinkage estima- 
tion involves compensating for the known over-dispersion of 
sample eigenvalues, in order to improve error performance 
and numerical stability. One common approach to covariance 
shrinkage is to preserve the sample eigenvectors, but alter 
the sample eigenvalues to improve error performance with 
respect to a given loss function |[8|, pO) . Although shrinkage 
estimators of this form often come with analytical guarantees 
regarding error performance, they do so at the cost of operating 
on the spectral decomposition of S directly. Consequently, 
such estimators are impractical for large data sets. 

Another approach is to construct the estimate S as a linear 
combination of S and a known positive definite matrix |j9), 
[ [TT) . For example, the Ledoit-Wolf estimator of | jTT| takes the 
form 

S = aS + /3Ip, (2) 

where the optimal shrinkage coefficients a, /3 > are given 
by 

(a*, /?*) = argmin ||(aS + /3Ip) - . 

(",/3) 

Since the optimal coefficients can only be obtained analytically 
in the case where S is known, the authors instead develop 
consistent estimators for a and /3 based on the data. Note 
that while estimators such as (|2| have lower computational 
demands than those that operate on the eigenvalues of S 
directly, once we have obtained an estimate S, we still must 
pay the full cost of 0{p^) if we wish to obtain its principal 
components. 



Covariance Estimation Using the Nystrom Method 

Let / C be a set of k indices. The Nystrom 



covariance estimator |[T4| takes the form 



= -XP(/)X'^, 



(3) 



where P (/) represents an orthogonal projection onto the 
subspace of M" spanned by the k rows of X specified by 
the indices in /. 

Defining S in this fashion serves two important purposes. 
First, we will show that Q is equivalent to the Nystrom ap- 
proximation of the sample covariance, an established method 
for low-rank approximation of positive semidefinite matrices 
1 12 1, ([T3 1. A primary advantage of this method is its computa- 



tional efficiency, as one can obtain the k principal eigenvalues 
and eigenvectors of S for a cost that scales linearly in p and 
n. Second, the projection P shrinks the singular values of the 
data, serving to counteract over-dispersion of eigenvalues in 
the sample covariance. 

III. The Nystrom Method 

The Nystrom method is a classical technique for obtaining 
numerical solutions to eigenfunction problems. When applied 
to matrices, it can be used to construct a low-rank approxima- 
tion of a positive semidefinite matrix as follows. 

Let Q be a p X p positive semidefinite matrix, represented 
in block form as 



Q 



Ql2 



Ql2 

Q22, 



where Qj^j^ is fc x 
preserves Cl^ and 
Nystrom extension: 



fc. The Nystrom approximation of Q 
Q]^2 while approximating Q22 by its 



Q12 



Q12Q11Q12 



(4) 



where Qj^ denotes the Moore-Penrose pseudoinverse of (^n- 
Since the approximation reconstructs and Q]^2 exactly, 
the approximation error Q — Q is characterized entirely by 
the Schur complement of Qn in Q, 

Qii — Q22 Qi2QiiQi2- 

If we view Q as the outer product of an underlying 
data matrix, an alternative way to characterize the Nystrom 
approximation is as a function of an orthogonal projection. 
Let X be a p x n matrix, partitioned as 



X 



(5) 



where Y is fc x n and Z is (p — fc) x n, and let 



Q = XX^ 



YY' YZ 



Tl 



We then define the n x n symmetric idempotent matrix 

P = Y^(Y^)+=Y^(YY^)+Y, 



(6) 



which represents an orthogonal projection onto the subspace 
of M" spanned by the fc rows of Y. We obtain the same 
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expression as in Q by approximating X with its projection 
XP: 



Q = XP(XP)^ = XPX^ 



YY' 



ZY' 



YZ' 

ZY'^fYY^) 



YZ' 



This interpretation illustrates the low-rank nature of Q, as we 
must have rank(Q) < rank(P) < k. It also highlights the 
fact that the approximation need not be restricted to the first 
k rows of X; we may instead choose to construct P based on 
any subset of k rows. Since different subsets typically yield 
different approximations, we can view Q as a function of a 
set of k indices / C {1, . . . (Throughout the article, we 
we will use Ajj to denote the submatrix of a matrix A whose 
rows and columns are specified by respective index sets / and 
J, and define A/ = A//.) 

The problem of selecting a suitable subset for Nystrom 
approximation is one that has received significant attention in 
the literature. Although a number of efforts have focused on 
developing advanced subset selection methods 1 17\ , |18|, for 
many applications satisfactory performance can be achieved 
simply by choosing / randomly with uniform probability fT2^, 
p^3 |. This simpler approach has the added benefit of enhancing 
the computational gains associated with the Nystrom method, 
and will be the strategy employed for the experiments in 
Sections [V] and [Vl] 

IV. The Nystrom Covariance Estimator 

We proceed by formally defining the Nystrom covariance 
estimator, after which we derive expressions for its bias and 
MSE. We then discuss its eigenvalue shrinkage properties and 
derive expressions for its eigenvalues and eigenvectors. 

Definition 1 (Nystrom covariance estimator). Let Ji. be a pxn 

matrix whose columns xi,...,x„ are i.i.d. random vectors 
such that E(xi) = and E(xiX^) — S for i — 1,. . . ,n. 
Let ri,...,rp denote the rows of X. Given a k-subset 
/ C {1, . . . we define the Nystrom covariance estimator 
of S as 

sm = -xp(/) x^, 

n 

where P (/) represents an orthogonal projection onto the 
subspace o/M" spanned by the set of vectors {r^ : i G /}. 

As previously discussed, is a function of an index 

set / C {!,..., p}, and thus error performance will con- 
ditional on /. Although viewed here as an estimator of S, 
the Nystrom covariance estimator could be interpreted as the 
Nystrom approximation of the sample covariance S. When 
rank(P(/)) = rank^) < min(p, n), this approximation is 
exact, and we have S(/) = S. 

Error Statistics 

Assume now that the columns of X are drawn independently 
from a p-variate normal distribution with zero mean and 
covariance S ;^ 0. In this case, we can derive analytical 
expressions for the bias and expected square error of the 



Nystrom covariance estimator We begin by computing the 
expected value of after which the bias matrix follows 

as a corollary. 

Tlieorem 1 (Expected value of Nystrom covariance estimator). 

Let a. be a p X n matrix whose columns xi, . . . , x„ are i.i.d. 
random samples from A^p(0, S). Let be the Nystrom 

covariance estimator of S given a k-subset I C {!,... ,p}, 
and define J = {1, . . . ,p} \ I. Then, [E(S(/))]^ = 

[E{W))]jj = M^{I))]^jj = inij. and 



E(S(/)) 



fc „ n — 

= 

J n n 



Proof Without loss of generality, let / = {1, . . . , k} and 
J = {k + 1, . . . ,p}. Partitioning X as in (|5]l, the Nystrom 
covariance estimate is given by 



XPX' 



YY' 



YZ' 



ZY' ZPZ^ 



where P represents an orthogonal projection onto the span of 
the rows of Y. By construction, we have 



and 



E(^YZ-) = [E(iZY-)]" = Sz., 

and thus we need only compute E(iZPZ"^). To perform this 
calculation, consider the nested expectation 



EfZPZM = Ey 



E(ZPZ'^| Y) 



Using standard properties of conditional distributions of nor- 
mal random vectors, one can show that given Y, the columns 
zi, . . . , z„ of Z are independent and normally distributed as 



z« |Y 



■ Mp-k ( 



Z\Y 



where 



and 



T I 



■'Z\Y — - 

Given these distributions, evaluating E(ZPZ^ | Y) is a matter 
of performing standard moment calculations using the proper- 
ties of normal random variables. For convenience, we apply a 
result from \\9\ Theorem 2.3.5] for normal random matrices, 
which states that for a random pxn matrix X whose columns 
xi,...,x„ are distributed as x^ ^ A^p(/Xj,S), if A is a 
constant p x p matrix, then 

E(XAX^) = tr (A) S + MAM^, 

where M — [^^ ■ ■ ■ Thus, 

E(ZPZ'^ I Y) = kT.1 + SfjS7iYPY^E7iS/j , 

and 



E(ZPZ-') = Ey E(ZPZ 

=Ey fcS/ 



s7jE7iYPY^S7iS,j 
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where the final equality follows from Ey(YPY^) = nHj. 
Dividing by n yields the desired result. □ 



Corollary 1 (Bias of Nystrom covariance estimator 1 14 1). Let 
"X. be a p X n matrix whose columns xi, . . . ,x„ are Ltd. 
random samples from A/'p(0,S). Let be the Nystrom 



covariance estimator of S given a k-subset / C {1, , 
and define J = {1, . . . ,p} \ /. Then the bias matrix 

B(S(/) I S) = S -E(S(/)) 

satisfies [B]ij = /or all (i, j) ^ J x J, and 



B 



Thus, is a biased estimator of S, except in the 

case where the Schur complement S/ = 0. Recalling from 



Section III that this Schur complement also expresses the error 
between S and its Nystrom approximation, we see that 
cannot be unbiased unless it is equal to the sample covariance. 



Theorem 2 (MSE of Nystrom covariance estimator 1 14 1). Let 
X be a p X n matrix whose columns xi, . . . ,x„ are i.i.d. 
random samples from A/'p(0,S). Let be the Nystrom 

covariance estimator of S given a k-subset I C {!,... ,p}, 
and define J = {1, . . . ,p} \ /. Then the mean square error of 
the Nystrom covariance estimator in Frobenius norm is 

E||S-S(/)||^ 
= MSE(S I S) 



MSE(S I S) 



n — k 
{n~k) 

n 2 



(n - fc - l)tr(sj) -tr2(S/) 



|S,||^-MSE(S, |S) 



where MSE (S | S) is the mean square error of the sample 
covariance estimator given in ([T]), and MSE (S/ | is the 
mean square error of the sample covariance estimator of the 
Schur complement of S/ in S, given by 

1 



MSE (Sj I S) 



tr(S;)+tr2(S,) 



Proof: Let / = {1, . . . , k] and J = {fc + 1, . . . ,p} with- 
out loss of generality. The MSE in Frobenius norm is 



Ells - 



tr 



tr 



E 



2tr 



(n2 



EE ( iXPX^ 



XPX^XPX^ 



(7) 



Substituting the expression for E ^^XPX^^j from Theo- 
rem [T] we have 



EE I -XPX' 

n 



(8) 



= tr(S?) +2tr(S/ 
+ ^tr( 

To compute E (^^^XPX^XPX^), let X be partitioned as 
in (|5]l, so that 

tr(XPX^XPX^) = tr (YY'^YY^) + 2 tr(YZ'^ZY^) 
+ tr(ZPZ^ZPZ^). (9) 



As in the proof of Theorem [T] the expectation of each term 
can be evaluated using standard properties of normal random 
vectors. However, we may simplify the analysis using a result 
from p9] Theorem 2.3.8], which states that for a random pxn 
matrix X whose columns xi , . . . , x„ are distributed as x; ^ 
Afp (/Xj, S), if A, B, and C are independent nxn, pxp, and 
n X n matrices (respectively), then 

E (XAX'^BXCX^) 

= ti-(C^A^)ti- (BE) S + tr (A) tr (C) SBS 
+ tr(AC'^) SB^S + tr (C) MAM^BS 
+ MAC'^M^B^S + tr(AM^BMC)S 
+ tr (BE) MACM'^ + SB^MA'^CM'^ 
+ tr (A) SBMCM'^ + MAM^BMCM"^, 

where M = [/Xj^ • • • We can compute the first term in 
(|9| by applying this formula directly; for the remaining two 
terms, we must use iterated expectation as we did in the proof 
of Theorem [T] Evaluating these expectations yields 



Etr(YY^YY^) = (r 



tr(S?) +ntr2(S/), (10) 



Etr (YZ'^ZY^) = (n^ + n) ti-(S/jSfj 
+ ntr(S/)tr(Sj), 



(11) 



and 



Etr(ZPZ^ZPZ^) = + k) tr(Sj) +ktx'^(T,i) 
+ 2n(A; + l)tr(S/R) 
+ 2n(A; + l)tr(S/)tr(R) 
+ (n^ +n)tr(R2) +fctr2(R), (12) 

where R = SfjE7^S/,/. Substituting ^ and ([T0|-([T2| into 
(|7]l and simplifying terms yields the result. □ 



Discussion of Error Results 

Given an arbitrary covariance S, we can derive a lower 
bound for the MSE of the Nystrom covariance estimator as 
follows. First, note that when applied to the spectrum of a 
positive semidefinite matrix A, the Cauchy-Schwarz inequality 
implies that tr^ (A) < tr(A^) rank (A). For A = S/, substi- 
tuting this inequality into the error expression in Theorem |2] 
yields 

Ells - S(/) 11^ > MSE (S I S) + (n - fc - 1) tr(s') 



^ti-(S^) rank(S^) 



MSE(S |S) 



(n— fc)(n— p— 1) 



tr S, 



where rank(Sj) = p—k. Thus, n < p is a necessary condition 
for the MSE of the Nystrom covariance estimator to be less 
than that of the sample covariance. 

We can show that equality is achieved for this bound 
when S = Ip. In this case, S/ = Ip-k for all fc-subsets 
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/ C {1, . . . and thus 

|2 



E S- 



MSE(S |S) + ^^(n-fc-l)(p-fc) 



= MSE(S |S) 



{n — k){p~k)(n~'p~l) 



where MSE (S | S) = [p^+pj/n. Since the second term in the 
above summation is negative if and only if n — p — 1 < 0, we 
see that when estimating the identity covariance, the Nystrom 
estimator achieves better error performance than the sample 
covariance for any n < p. 

This behavior may seem surprising, especially when noting 
that n < p implies that the sample covariance will be of 
rank n with probability one, while the Nystrom estimator 
will be of rank k < n. However, eigenvalue results such 
as the Marcenko-Pastur theorem indicate that in this regime 
the sample covariance will be over-dispersed. Even though 
the Nystrom estimator has fewer nonzero eigenvalues than 
does the sample covariance, the shrinkage it provides on these 
eigenvalues compensates for its lower rank, resulting in better 
overall error 

Eigenvalue Shrinkage 

The error performance of the Nystrom covariance estimator 
derives from its ability to shrink the over-dispersed eigenvalues 
of the sample covariance. One way to understand this property 
is as a consequence of the well-known eigenvalue inequalities 



of Weyl, a form of which |20 Corollary III.2.3] states that for 
any px p symmetric matrix A and pxp positive semidefinite 
matrix B, 

A,(A + B) > A,(A), (13) 

for i = 1, . . . ,p. 

Let S ^ Obeapxp sample covariance matrix, and let 
be the corresponding Nystrom covariance estimator given a k- 
subset / C {l,...,p}. As in previous discussions, we may 
let / = {l,...,k} without loss of generality. By positive 
semidefiniteness of the Schur complement, the error matrix 



E; 





Si 



is also positive semidefinite. Thus, ( [T3| ) implies that 

A,(S(/)+E) = A,(S) > A,(S(/)), 

for i = l,...,p. In other words, given any fc-subset, the 
Nystrom covariance estimator shrinks the eigenvalues of the 
sample covariance toward zero. 

Calculation of Eigenvalues and Eigenvectors 

We now derive expressions for the eigenvalues and eigen- 
vectors of the Nystrom covariance estimator For notational 
convenience, given an m x n matrix A and a fc-subset 
/ C {!,..., in}, let Aj: denote the k x n submatrix formed 
by taking k entire rows of A as indexed by /. Also, given 
r = rank (A), we define the thin singular value decomposition 
(thin SVD) of A as A = UDV^, where D IS an r X r 



diagonal matrix containing the nonzero singular values of A, 
and U and V are to x r and n x r matrices whose columns 
are the corresponding left and right singular vectors. 

Using this notation, the eigenvalues and eigenvectors of the 
Nystrom covariance estimator can be expressed as follows. 

Theorem 3 (Eigenvalues and eigenvectors of the Nystrom 
covariance estimator). Let 'X.be a pxn matrix whose columns 
Xi,...,x„ are i.i.d. random vectors such that E(xi) = 
and E(xixf) = S for i = Given a k-subset 

I C {1, . . . ,p}, define J = {1, . . . ,n}\I, let r = rank(X/.), 
and let X/^ have the thin SVD X/^ = UxDxV^. If W is 
the p X r matrix given by 



1 



:X.^V 



J: VX, 



then WW — is the Nystrom covariance estimator of 

S given I, and its r nonzero eigenvalues and corresponding 
eigenvectors are given by A^ and U, where W = UAV"'" is 
the thin SVD ofW. 

Proof: As in previous proofs, we let / = {!,..., fc} 
and J — {k + 1, . . . ,p} without loss of generality. Let X be 
partitioned as in (|5]), and let Y = X/: have the thin spectral 
decomposition Y = UyDyVy. Letting 



W 



UyDy 

ZYy 



we have 



WW' 



UyDy 

ZVy 



DyU^ 



UyD^Uy 



ZVyDvU 



UyDyV|;Z^' 
ZVyV^Z^ 



YY' 

zy"^ 



YZ' 

ZVyV^Z^ 



Noting that the Moore-Penrose pseudoinverse of Y is 



rT\ + 



we see that Nystrom projection of (|6]l is equivalent to 

P EE Y'^(Y^)+ = VyDyU^UyDy^V^ = YyVy, 

and thus WW^ = ^(/). Consequently, if UAV^ is the thin 
SVD of W, then 

£(/) = UAV'^VAU^ = UA^U'^ 

is the (thin) spectral decomposition of where A^ is an 

r X r matrix containing the nonzero eigenvalues of and 
U is a p X r matrix whose columns are the corresponding 
eigenvectors. □ 
For fixed fc, the computational complexity of the spectral 
decomposition of S(/) is dominated by two operations. The 
first is the multiplication ^j-Vx, the cost of which scales as 
0{pn)\ the second is the thin SVD of W, the cost of which 
scales as 0{p). Thus, for n fixed the overall cost of eigen- 
analysis scales linearly in the dimension p, making Nystrom 
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d = 



Figure 1. A simple narrowband beamforming model. 

covariance estimator an appealing choice in extremely high- 
dimensional settings. 

V. Example Application: Adaptive Beamforming 

We conclude our discussion with two examples of the use 
of Nystrom covariance estimation in practical applications. For 
the first example, we examine the classical signal processing 
problem of beamforming pTj-|[24), which involves tailoring 
the signal response (or "beam pattern") of an array of receiving 
elements to enhance one's ability to detect a desii-ed signal. 

Narrowband Beamforming Model 

We adopt a standard beamforming model, illustrated in 
Figure [T] Consider a collection of p sensing elements, which 
sample an incoming plane-wave signal at discrete time in- 
tervals. We assume the signal of interest is narrowband and 
that the sensors are placed in a straight line with equal 
spacing (known as a uniform linear array). To avoid aliasing 
in the spatial sampling of the signal, we assume an element 
spacing of d = Ac/2, where Ac is the carrier wavelength. In 
addition, let denote the angle between the wave's direction 
of propagation and a vector normal to the array, referred to as 
the angle of arrival. 

Assume that there are A; < p incoming signals, and let 
Zi{t) S C denote the complex envelope of the i-th signal at 
the t-lh sample time, for i G {1, . . . , fc} and t G {1, 2, . . . }. 
Arranging the signal response for all array elements into 
a vector, the total received signal at the i-th sample time 
(referred to as a "snapshot") is 

k 

x(t) = ^a(0,:)^»(i) + n(t), (14) 

i=l 

where n{t) e is additive noise and where a(0i) e 
represents the amplitude change and phase delay at each sensor 
as a function of 9i, the angle of arrival of the «-th signal. 
Assuming a constant amplitude response across all elements 
and letting the phase response at the first element be zero, we 
have 

for Z = 1 , . . . , p and where j = \/— T. 

Now, let us assume that out of the k incoming signals, 
only one — say, zi{t) — is of interest to us, and all others are 



considered interference. In this context, the classical narrow- 
band beamforming problem can be stated as follows: given 
a collection of n snapshots {x(l), . . . , x(n)}, determine a 
weight vector w e such that the output of the linear filter 
(or "beamformer") 

Zl{t) = w^x(t) 

such that the mean squared error of zi (t) — zi (t) is small. 

For our simple example, let us assume that given the narrow- 
band beamforming model in ( [T4] i, the signals zi{t), . . . , Zm (t) 
and the noise n{t) are stationary zero-mean Gaussian random 
processes that are statistically independent across time sam- 
ples. Let af — E(^zf{t)) denote the expected power of the 
i-th signal, and let af^ ~ E(n^(t)) denote the expected noise 
power. In this case, it can be shown p4| , p5) that the optimal 
beamformer (in the sense of minimum MSB) is given by 

Wopt = min E (zi{t) - w^x(t))^ = S"^a(6li) aj, (15) 

where 

fe 

S = E (x(t) x"{t)) = '^f ^(^») ^''(^*) + Ip • 

i=l 

Unfortunately, there are a number of barriers to realizing 
this optimal beamformer in practice. Even if our modeling 
assumptions hold, we need to know the power af and angle 
of arrival 9i of the signal of interest, as well as the covariances 
of the interference and noise. 

Beamforming Using the Nystrom Covariance Estimator 

Let assume that al and 9i are known. In this case, we might 
consider approximating S by the sample covariance 

S=^f]x(t)x^(t), 

t=i 

given a collection of n observations. Although this approach 
will recover the optimal beamformer as n — oo, in practice 
the number of samples is bounded, and any assumption of 
stationarity regarding the signal and interference usually is 
valid only over a limited time window. This issue is especially 
problematic when the number of elements p is large, as we 
may not have enough samples for S to be well-conditioned (or 
even invertible). As a result, direct substitution of the sample 
covariance into ([T5| is rarely an acceptable solution. 

One alternative is to replace S with a low-rank approxima- 
tion fT^, f2^-f2E\. For example, consider the optimal rank-m 
approximation 

SI = UfcA.Uf , 

where Aj. is a fc x fc diagonal matrix containing the k largest 
eigenvalues of S, and U^: is a p x A: matrix containing 
the corresponding eigenvectors. We can define a "projection 
beamformer" (denoted Wproj) by approximating ( fTS) as 

Wp,oj = (S^)+a(0i)a2, (16) 

where 

(S^)+ = U,A^iuf. 
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Consequently, the filter output can be expressed as 

z,it) = Wp^„jX(i) - alsL^ie,) (S^)+Px(i), 

where P = UmU^ is an orthogonal projection onto the 
TO-dimensional principal subspace of S. Thus, the projec- 
tion beamformer maps the incoming signal onto a low-rank 
subspace, and then approximates the behavior of optimal 
beamformer within this space. If the range of P is close to 
the span of {a{9i)zi{t), . . . ,a{9m)zm{t)}, any signal power 
lost in the low-rank projection will be largely due to noise, 
thus improving overall estimation performance. Of course, 
the effectiveness of this approach in practice depends on the 
powers of the signal and interference relative to the noise, as 
well as the sample size. 

The projection beamformer of ( [T6] l presents an excellent 
opportunitj^ to apply the Nystrom covariance estimator. Sub- 
stituting for the optimal low -rank approximation of 
the sample covariance, we define the "Nystrom beamformer" 
(denoted WNyst) as 



WNyst= (S(/))+a(0i)a2 



(17) 



given a fc-subset / C In contrast to to the 

projection beamformer, the Nystrom beamformer characterizes 
the signal covariance only over a subset of fc < p sensors in 
the array; the rest of the covariance is inferred as a function 
of observed correlations between elements in this subset and 
the remaining sensors. The goal of this approach is to achieve 
performance comparable to that of the projection beamformer, 
while realizing significant reductions in computational cost. 

Experimental Results 

To compare the performance of various beamforming ap- 
proaches, we simulated a uniform linear array with p — 100. 
We considered a case with k — 7 signals, where the angle 
of arrival of desired signal was 10 degrees, and the six 
interference signals had angles of arrival of —65, —30, —25, 
30, 45, and 60 degrees. For all experiments, we assumed a 
constant noise power of cr^ = 1 and a constant interference- 
to-noise ratio (INR) of 20 dB. 

Given a signal-to-noise ratio (SNR) for the desired signal, 
we studied the performance of each method as a function of the 
number of snapshots n. Our primary measure of performance 
was the signal to interference and noise ratio (SINR) of the 
estimated signal, defined as 

SINR = , 

where zi(t) = w^x(t) and z(t) is the sum of the received 
interference and noise at the t-th sample time. 



i=2 



Figure |2] shows the performance of various beamformers for 
three different SNR levels (-10 dB, 10 dB, and 30 dB). For 
each beamformer, we plot the SINR in dB as a function of the 
number of snapshots, averaged over 1000 experimental trials. 
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Figure 2. SINR as a function of number of snapshots for various beamform- 
ing approaches, given an INR of 20 dB and SNR values of —10 dB (top), 
10 dB (middle), and 30 dB (bottom). 



Note that the horizontal axis is measured on a logarithmic 
scale, ranging from n = 10 to n = 10^ samples. 

Results are shown for projection beamformer of ( [T6] l and 
the Nystrom beamformer of ( fTT] ), where the latter is computed 
using a uniformly random fc-subset / C {l,...,p}. For 
comparison, we include two approximations of the optimal 
beamformer: one where we have replaced S in ([T5| with the 
sample covariance, and another where we have substituted the 
Ledoit-Wolf covariance estimator of pT| . We also show an 
upper bound given by the theoretical SINR of the optimal 
beamformer, 



SINRop, 



E |wg.x(^) 



wg,Sw 



opt 



opt 



where = E(z(t) z^(i)) is the covariance of the interfer- 
ence plus noise. 

Since the purpose of the Nystrom beamformer is to achieve 
satisfactory error performance for a low computational cost, 
we also compared the running time of each algorithm. For the 
low-SNR case (—10 dB), Figure |3] shows the average CPU 
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Figure 3. Average running time in seconds of various beamformers for INR 
= 20 dB and SNR = — 10 dB, as a function of the number of snapshots. 



time required to compute each beamformer on a personal 
computer with an Intel dual-core 2.33-GHz processor, as a 
function of the number of snapshots. 

In the low-SNR case, the SINR performance of the Nystrom 
beamformer is comparable to that of the projection beam- 
former, with the former trailing the latter by a margin of 
0.4 to 1.6 dB. Both low-rank methods perform considerably 
better than beamforming using the sample and Ledoit-Wolf 
covariance estimators. We see that the sample covariance 
beamformer (which is undefined until the sample covariance 
becomes invertible at n = 100 snapshots) exhibits the poorest 
performance, due to the over-dispersion of its eigenvalues. 
Although shrinkage provided by the Ledoit-Wolf beamformer 
does improve the conditioning of the sample covariance beam- 
former, both approaches remain inferior to the low-rank meth- 
ods until the number of snapshots grows large (n > 4000). 

We observe similar performance trends in the medium-SNR 
(10 dB) and high-SNR (30 dB) cases. At an SNR of 10 dB, the 
SINR performance of the Nystrom beamformer lags behind 
that of the projection beamformer by a margin of 0.06 to 
1.4 dB, and at an SNR of 30 dB, the difference is minor 
(less than 0.15 dB) across all values of n. In both cases, the 
SINR performance of the Ledoit-Wolf and sample covariance 
estimators is consistently about 10 to 20 dB less than that of 
the low-rank methods. 

In terms of computation, the Nystrom beamformer requires 
about an order of magnitude less time to run than the pro- 
jection and Ledoit-Wolf methods, for n up to around 100. 
However, as n grows large with p fixed, the complexity 
of all four approaches is dominated by the 0{n) cost of 
computing covariance terms over the set of snapshots, and 
thus the computational differences between the beamformers 
become less dramatic. Alternatively, if we were to let p grow 
while keeping n fixed, the computational cost would instead be 
dominated by the low-rank approximation step (or by matrix 
inversion, in the case of the Ledoit-Wolf or sample covariance 
beamformers). In this case, the Nystrom beamformer would 
continue to exhibit significant computational savings when 
compared to the other methods. 

VI. Example Application: Image Denoising 

For our second example of an application for Nystrom 
covariance estimation, we consider the problem of image 



denoising p9)-pl). Let z e M™ represent an 8-bit grayscale 
image with m pixels, where each element € {0, . . . , 255} is 
the intensity of the i-th image pixel for i — 1, . . . , to. Assume 
that we obtain a noisy version of this image x = z + n, where 
n ^ Nm (O, fj^Im). Given x, we wish to compute an estimate 
z of the clean image z. 

Many approaches to image denoising involve computing 
local decompositions of the noisy signal over sets of nearby 
pixels (or "image patches"). We will investigate a denoising 
solution based on principal components analysis (PCA) of 
groups of patches, which requires estimating and then decom- 
posing local covariance matrices. 

Subspace Estimation for Image Denoising 

We begin by developing a simple image model allowing us 
to perform denoising as a subspace estimation problem. Since 
natural images often possess a high degree of local similarity, 
a common assumption in image processing is that given a spa- 
tially proximate set of pixels, there exists some transformation 
under which these pixels admit a sparse representation [29 J , 
|32|. Let us partition z into q sub- vectors zi, . . . , z,, where 
each Zi e is a local set of p pixels (referred to as a "patch"), 
with m = pq. The observed image patches are Xj = z^ + n^, 
where n,; ^ Afp (O, cr^Ip) for i — 1, . . . ,q. Assume now that 
each patch Zj is restricted to a subspace of dimension fcj < p, 
which we denote Si. In this case, it can be shown | |33J that a 
linear least-squares estimator of Zi given x^ is 



P,x, 



(18) 



where represents an orthogonal projection onto Si. This 
estimator preserves Zj, while removing all noise except for 
the component in the signal subspace. 

In practical applications, the subspaces Si, . . . , Sq typically 
are not known and must be estimated from the noisy image. 
Consider a set of n noisy patches {x^ : i G I}, which lie 
within a region of the image defined by a set of n patch indices 
/ C {1, . . . , g}. If the patches in this region have similar signal 
characteristics, then we may assume that all Si are equal for 
i G I. This assumption allows us to estimate the subspaces 
from the ki principal components of the sample covariance 
matrix 

s = i5]x.xr 

Note that in practice the subspace dimension is also unknown 
and may vary across regions. To address this problem, one may 
attempt to solve the rank estimation problem of determining 
ki from the noisy image. However, for our simple example we 
set ki to a fixed value across all images. 

By repeating the component analysis for all regions, we can 
obtain a full set of projection matrices for computing the image 
estimate. We will refer to image estimation as in ( fTSj ) where 
the projections are estimated from the principal components 
of sample covariances as the PCA image denoiser. 

Depending on the size of the image, estimating the full 
set of orthogonal projections for PCA denoising can require 
computing and then decomposing or inverting a large number 



Figure 4. High-resolution test images for image denoising experiments. Top left: flower (2256 X 1504). Top right: leaves (3008 X 2000). Bottom left: hdr 
(3072 X 2048). Bottom right: deer (4032 X 2640). 



of covariance matrices. Consequently, we may realize signif- 
icant improvements in computation by replacing the sample 
covariance with a Nystrom covariance estimate, which we will 
refer to as the Nystrom image denoiser. 

Although we have defined both approaches for disjoint 
image patches and regions, one may want to allow for some 
amount of overlap among these sets. This modification in- 
creases the number of available samples, while also mitigating 
some of the artifacts that occur at patch boundaries. While 
commonly used in practice f29l, f30l, note that allowing for 
overlapping patches conflicts with independence assumptions 
of our additive noise model. It also increases computation due 
to the additional number of patches, and because estimated 
patches may need to be reweighted before they are combined. 

Experimental Results 

To compare the PCA and Nystrom image denoisers, we ex- 
amined their performance when applied to a selection of 8 -bit 
high-resolution test images from |34|. The four images used 
are shown in Figure |4] Our primary measure of performance 
is the peak signal-to-noise ratio (PSNR) of the estimated 
image, a standard metric used throughout the image processing 
literature. The PSNR of the denoised image z is defined as 



where Zmax denotes the maximum allowable pixel value (in 
our case, 255). After generating noisy versions of each image 
for noise levels of cr = 10, 20, and 50, we attempted to 



reconstruct the original image using the PCA and Nystrom 
approaches, assuming a fixed subspace dimension of fc = 4. 

By examining denoising results for different patch and 
region sizes over the selection of test images, we determined 
a set of default parameter values that yielded a reasonable 
balance of performance and computation. For both algorithms, 
we divided the image into regions of 32 x 32 pixels, with 
adjacent regions having 50% overlap. We then divided each 
region into 8x8 patches, also with 50% overlap. Thus, each 
region contained rt = 49 patches, which were used to estimate 
a local covariance of dimension p = 64. 

In the case of the Nystrom denoiser, covariance estimation 
was performed for each region using the Nystrom covariance 
estimator, conditioned on a set of k patch vectors chosen 
uniformly at random. Once the k principal components and 
the estimated projection were computed, the denoised patches 
were superimposed to reconstruct an estimate of the original 
image. 

As a benchmark, we provide results from two other patch- 
based denoising methods: the K-SVD algorithm of |30| and 
the BM3D algorithm of The first algorithm performs 

denoising based on a trained "dictionary" of components 
obtained from the noisy image, while the second jointly filters 
groups of image patches by arranging them into 3-D arrays. 
For both algorithms, denoising was performed using code from 
the authors' respective websites, with most parameters set to 
their default values. The only parameters we adjusted were the 
dictionary size and maximum number of training blocks for 
the K-SVD algorithm; to accommodate the high resolution of 
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Table I 

Average PSNR (in dB) of denoising algorithms for 
high-resolution test images 



Image 


cr/PSNR 


PCA 


Nystrom 


K-SVD 


BM3D 




in/ Tfi 1 1, 

iU / Zo. ID 


'XI Q 1 
J / .O 1 


JO.OU 






jiower 


ZU / ZZ. i 1 


JZ. J 1 




jZ. io 






50/14.15 


24.50 


26.69 


28.73 


36.06 




10/28.13 


27.99 


27.26 


31.39 


35.39 


leaves 


20/22.11 


27.00 


26.15 


27.56 


31.61 




50/14.15 


22.92 


22.88 


23.16 


26.78 




10/28.13 


37.18 


37.54 


32.58 


42.51 


hdr 


20/22.11 


32.08 


33.42 


29.13 


39.20 




50/14.15 


24.44 


26.55 


25.42 


34.76 




10/28.13 


33.47 


33.47 


30.51 


34.19 


deer 


20/22.11 


30.58 


31.46 


26.67 


33.33 




50/14.15 


24.17 


26.05 


22.57 


31.98 



test images, these values were increased to 1024 and 130,000, 
respectively. 

Results for all four algorithms are listed in Table |l] For 
each test image, algorithm, and noise level, we list the average 
empirical PSNR over 10 realizations of each noisy image. We 
see that the performance of the PCA and Nystrom denoisers 
is comparable to the benchmark algorithms, with average 
PSNR values typically falling somewhere between those of 
the K-SVD and BM3D algorithms. These results suggest that 
subspace-based denoising provides a reasonable venue for 
testing the capabilities of the Nystrom covariance estimator. 

In comparing the PCA and Nystrom denoisers, we find that 
despite requiring significantly less computation, the Nystrom 
approach actually performs slightly better in most cases. One 
explanation for this behavior is that the shrinkage performed 
by the Nystrom covariance estimator allows for improved 
subspace estimation. 

To better illustrate the computational differences between 
the algorithms, in Figure |5] we show the average run time of 
each method when applied to a noisy version of the hdr image 
with a = 20. Results are shown for the original 2048 x 3072 
image, as well as for resampled versions with sizes of 64 x 96, 
128 X 192, 256 x 384, 512 x 768, and 1024 x 1536. We see 
that the costs of the PCA, Nystrom and BM3D denoisers 
scale similarly with image size, with the Nystrom denoiser 
performing about 2-3 times faster than the PCA denoiser and 
3-9 times faster than BM3D. Note that due to the complexity 
of its training step, the cost of K-SVD is high (though 
relatively constant) across all image sizes. 

In Figure [6j we show 256 x 256 close-ups of denoising 
results for the test image hdr, given a noise level of cr = 20. 
The Nystrom-denoised image is visibly smoother and contains 
fewer artifacts than does the PCA-denoised image, as there is 
less mismatch between its estimated subspace and the "true" 
subspace represented by the clean image data. This difference 
is reflected in the PSNR values for each method. 

VII. Summary 

In this article, we developed the Nystrom approximation 
as a low-rank covariance estimator In addition to deriving 
expressions for its bias and mean squared error for the case 
of normally-distributed data, we showed that the Nystrom 




Figure 5. Average running times in seconds of various denoising algorithms 
when applied to different sizes of hdr image. 

covariance estimator shrinks the sample eigenvalues. This 
shrinkage allows the estimator to achieve performance that is 
comparable to (or at times, better than) the sample covariance, 
particularly in cases where the number of samples is less than 
the dimension of the data. Moreover, because of the com- 
putational advantages of the Nystrom covariance estimator, its 
eigenvalues and eigenvectors can be computed for far less than 
the cost of spectral analysis of the sample covariance. 

We illustrated the potential of the Nystrom covariance 
estimator through its use in two example applications: array 
signal processing and image denoising. In the first example, 
we adapted a projection-based beamforming algorithm to uti- 
lize the Nystrom estimator, resulting in reduced computation 
with little degradation in our ability to recover the desired 
signal. In the second example, we developed a simple PCA- 
based algorithm for image denoising, and then showed how 
the Nystrom covariance estimator could be used to reduce 
computation while maintaining or even improving denoising 
performance. 
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